October 2020
Why do we want to plot data?
Looking at the data as a first step of analysis is always a good idea
A striking example of this is the “Datasaurus dozen”: a dull an not impressive dataset.
x and y, over 13 different conditionsdata/DatasaurusDozen.tsv) and compute mean and st.dev. by dataset| dataset | mean_x | mean_y |
|---|---|---|
| away | 54.27 | 47.83 |
| bullseye | 54.27 | 47.83 |
| circle | 54.27 | 47.84 |
| dino | 54.26 | 47.83 |
| dots | 54.26 | 47.84 |
| h_lines | 54.26 | 47.83 |
| high_lines | 54.27 | 47.84 |
| slant_down | 54.27 | 47.84 |
| slant_up | 54.27 | 47.83 |
| star | 54.27 | 47.84 |
| v_lines | 54.27 | 47.84 |
| wide_lines | 54.27 | 47.83 |
| x_shape | 54.26 | 47.84 |
But if you plot it, you’ll see stark differences
Plotting allows one to convey a lot of information in a compact way
It is important to make good plots
i.e., plots that look good…
…and are honest to the data
it is very easy to hide the message rather than highlighting it
it is very easy to mislead with a plot
so let’s start with a gallery of bad plots. Can you guess why they are bad?
Examples:
We will start by using the built-in dataset mpg
mpg
## # A tibble: 234 × 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp… ## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp… ## 3 audi a4 2 2008 4 manu… f 20 31 p comp… ## 4 audi a4 2 2008 4 auto… f 21 30 p comp… ## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp… ## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp… ## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp… ## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp… ## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp… ## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp… ## # … with 224 more rows
skimr::skim(mpg)
| Name | mpg |
| Number of rows | 234 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| manufacturer | 0 | 1 | 4 | 10 | 0 | 15 | 0 |
| model | 0 | 1 | 2 | 22 | 0 | 38 | 0 |
| trans | 0 | 1 | 8 | 10 | 0 | 10 | 0 |
| drv | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
| fl | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
| class | 0 | 1 | 3 | 10 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| displ | 0 | 1 | 3.47 | 1.29 | 1.6 | 2.4 | 3.3 | 4.6 | 7 | ▇▆▆▃▁ |
| year | 0 | 1 | 2003.50 | 4.51 | 1999.0 | 1999.0 | 2003.5 | 2008.0 | 2008 | ▇▁▁▁▇ |
| cyl | 0 | 1 | 5.89 | 1.61 | 4.0 | 4.0 | 6.0 | 8.0 | 8 | ▇▁▇▁▇ |
| cty | 0 | 1 | 16.86 | 4.26 | 9.0 | 14.0 | 17.0 | 19.0 | 35 | ▆▇▃▁▁ |
| hwy | 0 | 1 | 23.44 | 5.95 | 12.0 | 18.0 | 24.0 | 27.0 | 44 | ▅▅▇▁▁ |
ggplot2. Why?Advantages of ggplot2
grammar of graphics (Wilkinson, 2005)The basic idea: independently specify plot building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include:
data
aesthetic mapping
geometric object
statistical transformations
scales
coordinate system
position adjustments
faceting
As in a grammar the minimal sentence is a subject in a plot the minimal object is data
ggplot(mpg)
In a grammar, you need a verb. In plots, this is axis
p <- ggplot(mpg, aes(x = displ, y = hwy)) p
But you also need an object. In ggplot, this is geoms
p + geom_point()
But you also need an object. In ggplot, this is geoms
p + geom_smooth()
You can add (+) as many geoms as you wish
p + geom_smooth()+geom_point()
++ to add color, fill, size, shape, etc…p + geom_point(aes(color=class))
p + geom_point(aes(size=cyl))
p + geom_point(aes(size = cyl, color=class))
p + geom_point(aes(shape=fl))
p + geom_point(aes(color=manufacturer, shape =fl, size = cyl))
ggplot()ggplot(df, ...)aesthetics (x, y, color, fill, shape, size, …)ggplot(df, aes(dimension = variable))geom_*+ geom_line()geoms inherit the aes of the plot if not specifiedaes vary with the datap + geom_point(aes(color=manufacturer, size = cyl))+facet_grid(.~fl)
A ggplot is made up of
And then you can change how things look and behave: - coordinate functions (changing the axis appearance and type) - scale functions (changing the appearance of the geoms) - theme functions (changing the appearance of the plot itself)
Plot types depend on the variable type
p <- ggplot(mpg, aes(drv)) p + geom_bar()
p <- ggplot(mpg, aes(drv)) p + geom_bar(aes(color=drv))
p <- ggplot(mpg, aes(drv)) p + geom_bar(aes(fill=drv))
p <- ggplot(mpg, aes(drv)) p + geom_bar(aes(fill=class))
p <- ggplot(mpg, aes(drv)) p + geom_bar(aes(fill=class), position = position_dodge())
p <- ggplot(mpg, aes(drv)) p + geom_bar(aes(fill=class), position = position_fill())
p <- ggplot(mpg, aes(hwy)) p + geom_histogram()
p + geom_histogram(bins = 10)
p + geom_histogram(bins = 100)
p + geom_dotplot(binwidth = 0.5)
p + geom_density()
p + geom_density(adjust = 3)
p + geom_density(adjust = 0.5)
Plot types depend on the variable type
if two variables are continuous, your choice is scatter
p <- ggplot(mpg, aes(x = cty, y = hwy)) p + geom_point()
still, you might just want to show the general tendency
p + geom_smooth()
or both
p + geom_smooth() + geom_point()
one variable discrete, the other continuous (note: it needs a
summarise())
mpg %>% group_by(manufacturer) %>% summarise(n = n()) %>% ggplot(aes(manufacturer, n))+ geom_col()
the above could have been easily done with geom_bar (that counts for us)
mpg %>% ggplot(aes(manufacturer))+ geom_bar()
but columns give you more options, since now you condition on a proper variable (n). For instance: order by n
mpg %>% group_by(manufacturer) %>% summarise(n = n()) %>% ggplot(aes(reorder(manufacturer, -n), n))+ geom_col()
boxplots show a distribution but can do so over different levels of a categorical var
mpg %>% ggplot(aes(drv, hwy))+ geom_boxplot()
boxplots are bulky and only show relevant info. Want full distribution? Use violins
mpg %>% ggplot(aes(drv, hwy))+ geom_violin()
remember: all is modular. We can always add color, fill…
mpg %>% ggplot(aes(drv, hwy, color = drv, fill = drv))+ geom_violin()
remember: all is modular. …facets
mpg %>% ggplot(aes(drv, hwy, color = drv, fill = drv))+ geom_violin()+ facet_grid(.~year)
if both variables are categorical, you can count their cross-tabulation
mpg %>% ggplot(aes(fl, drv))+ geom_count()
Plot types depend on the variable type
two variables define the x,y grid. A third defines the color of the cell. city consumption by year and drive (note: usually requires
summarise())
mpg %>% group_by(year, drv) %>% summarise(n = n()) %>% ggplot(aes(x = drv, y = year, fill = n)) + geom_tile()